Skip to content

Conversation

asteurer
Copy link
Collaborator

@asteurer asteurer commented Sep 8, 2025

When you get a chance, will you review and let me know if this is good to merge?

@asteurer
Copy link
Collaborator Author

asteurer commented Oct 9, 2025

Closes #8 (probably)

Copy link
Owner

@calebschoepp calebschoepp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran this locally and it worked. Very exciting!! I think we have quite a bit of iteration to do before this is ready. My first concern is getting the WIT looking right then the Spin and opentelemetry-wasi implementations will fall out of that.

Still loading all this stuff back into my head after a long time away so this is just a preliminary bout of comments and questions to get us started.

It might speed up this whole process if you could in a paragraph or two provide some color on how you ended up with the WIT that you have right now so I can have some more context.

Good work. Excited to see all this.

@asteurer
Copy link
Collaborator Author

asteurer commented Oct 15, 2025

My process for building this came from looking closely at the opentelemetry-rust implementation for a ResourceMetrics type and attempting to translate it into WIT. I didn't spend a lot of time looking at other language implementations of ResourceMetrics, so that might be a good next step for further refining the WIT.

Let me know if you need more context.

@asteurer
Copy link
Collaborator Author

In case there are concerns about me referencing an unstable API for metrics, looks like the folks in the otel-rust group in the CNCF slack declared metrics and logs stable: https://cloud-native.slack.com/archives/C03GDP0H023/p1759196279655829?thread_ts=1759196007.336079&cid=C03GDP0H023

@calebschoepp
Copy link
Owner

Spent some time reading the spec 1, looking at what you have, and pondering. I think it would be useful to back up and look at this from a 10,000 foot view.

10,000 foot view

At the highest level we have a component that has a bunch of metrics that need to make it out of the component (typically into the host runtime, but maybe a parent component in the case of composition). There's two ways we can do this: push or pull. Push means the guest decides when to send it's metrics to the host. For example this is how wasi-otel traces works. Pull means that the host decides when to get metrics from a guest.

Pull

Pull means that the host decides when to get metrics from a host. Some pseudo-WIT might look like:

export collect: func() -> result<metrics, error>;

The guest would provide an implementation of this collect function and the host would get to call it whenever it wants2 to collect the guest metrics.

On the opentelemetry-wasi side of things (ref) we would likely model it as something like WasiPullMetricExporter that implements the MetricReader trait/interface.

I imagine a lot could be said on why there are pros and cons to the pull architecture, but the best I have right now is that my gut is telling me we should explore push first. I imagine in the future we may need to support both push and pull for different use cases.

Push

Push means the guest decides when to send it's metrics to the host. Some pseudo-WIT might look like:

import export: func(my-metrics: metrics) -> result<_, error>;

The most naive way we could model this in opentelemetry-wasi is by saying that each instrument will immediately export the data when it is called therefore bypassing the aggregation that occurs within the opentelemetry SDK for metrics normally. This is potentially simpler, but I don't think we should do this because a common reason metrics is used is for performance critical parts of code where you don't want to be jumping between the host and guest all the time.

A less naive way to model this would be to have something like WasiPushMetricExporter that implements the MetricExporter trait/interface. A MetricExporter doesn't have a way to get at the aggregated metrics that the OTel SDK is holding though so we need to combine it with a MetricReader that can. In normal OTel land you would do this with some kind of PeriodicReader that embeds a MetricExporter. On a regular interval it would read the metrics and push them out with the exporter.

In our pre-wasi-p3 world though it is hard to build a PeriodicReader because we don't really have threading and async is a bit of nightmare. We could use the ManualReader but it doesn't embed a MetricExporter and automatically do the thing out of the box. This leaves us two options:

  1. Create a CustomManualReaderThatIsToUnblockUsButTheoreticallyUsefulElsewhere that is just a ManualReader but embeds a MetricExporter in it and exports when you run collect.
  2. Tell the consumer of opentelemetry-wasi to wire up a ManualReader and the WasiPushMetricExporter themselves.

I don't know which is the right choice. It's worth noting that in either pattern we would basically telling the user to make sure they run some version of exporter.export(reader.collect()) at the end of their component code. This means we're hosed for component dependencies because how does the component dependency know when to export3.

The world is complicated

Here's a couple of unordered things that are complicate the design space:

  • WASI P3 is close, but not fully here yet. Given this I don't really want to design for a P2 world, but it is hard to design for the P3 world that is not fully here yet.
  • Theoretically we want wasi-otel to support all wasm use cases e.g. long running, component dependencies, etc. In practice we should probably just design for a simpler FaaS single instance kind of use case and then expand from there.
  • We have approximately no input from end users on how they want to do metrics (push/pull) to inform us. Another thing just pushing us in a direction of action and iteration rather than trying to find the perfect solution.

Looking at your current implementation

Let's interpret your current implementation through the lens I've laid out so far.

It seems like you've modelled it off of opentelemetry-prometheus (which is pull) but made it push. Best I can tell WasiMetricExporter doesn't actually do anything. WasiMetricCollector is basically acting as a MetricExporter but doesn't actually implement the push trait.

My understanding of the necessity of WasiMetricExporter may be wrong though so I would love to be corrected.

What do I think we should do

We don't have all the info to make a perfect decision, but here is what I think we should do.

  • Go for the push pattern.
    • Play around to see if option 1 or 2 is more ergonomic with regards to readers.
  • Once working get it merged.
  • That's probably enough progress to unblock landing this stuff in Spin and making phase progress with the wasi proposal.
  • In the background keep exploring to see what pull based implementation would look like.

Footnotes

  1. All of API, data model, and SDK, but SDK was most relevant

  2. In practice wasi-otel would likely outline semantic conventions of when a host should poll for metrics.

  3. This will eventually be fixed by WASI P3, or works in the pull model, or you could just always push which is also a kind of sucky solution.

@asteurer asteurer force-pushed the rust-metrics branch 2 times, most recently from 97a1dd7 to 6435fa9 Compare October 16, 2025 23:28
@asteurer asteurer requested a review from calebschoepp October 16, 2025 23:28
@asteurer
Copy link
Collaborator Author

asteurer commented Oct 16, 2025

Edit: No longer relevant

Signed-off-by: Andrew Steurer <[email protected]>
Copy link
Owner

@calebschoepp calebschoepp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not a full review, but going to stop here b/c I don't want to get into the nits until we address this core stuff

Comment on lines +7 to +8
/// `collect` gathers all metric data related to a Reader from the SDK
collect: func(metrics: resource-metrics) -> result<_, otel-error>;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to be called export now.

Comment on lines +262 to +266
variant otel-error {
already-shutdown,
timeout(duration),
internal-failure(string),
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm skeptical that we're actually observing these set of errors with our setup. Are you sure this is the exact set of errors that are possible and that we want to bake into the wit? Or should we just have the error be a string for now?

Comment on lines +259 to +262
/// The WASI representation of the `OTelSdkError`.
///
/// See https://github.com/open-telemetry/opentelemetry-rust/blob/353bbb0d80fc35a26a00b4f4fed0dcaed23e5523/opentelemetry-sdk/src/error.rs#L15
variant otel-error {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Redundant to say something is a WASI X in a WIT file.

Nit: Don't love us linking to the Rust SDK as opposed to some canonical OTel reference spec page.

Comment on lines +47 to +55
/// Aggregated metrics data from an instrument.
variant aggregated-metrics {
/// All metric data with `f64` value type.
%f64(metric-data),
/// All metric data with `u64` value type.
%u64(metric-data),
/// All metric data with `s64` value type.
%s64(metric-data),
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand why this variant is necessary.

}
}

impl MetricReader for WasiMetricReader {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe stick the comment I suggested about why we have a manual reader here to explain all these simple impls 🤷 ?

Comment on lines +68 to +76
fn temporality(&self, kind: InstrumentKind) -> Temporality {
match kind {
InstrumentKind::ObservableCounter
| InstrumentKind::ObservableGauge
| InstrumentKind::ObservableUpDownCounter => {
panic!("Async InstrumentKinds are not yet supported");
}
_ => self.reader.temporality(kind),
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why aren't they supported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of observable instruments is that they periodically export in the background. Using a ManualReader to export observable instruments would mean that metrics would be sent all at once, effectively making them non-observable counters. I think it would be confusing for people to be able to use observable counters and not have them work how they expect.

@@ -0,0 +1,17 @@
[package]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the tracing and metrics examples living in the same example is nicer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure! Part of me likes the clear separation of examples; however, I can see how keeping everything together might make for a more interesting example. I'm open to either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants